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Abstract 

An evolutionary tree is a cascade of bifurcations starting from a single 
common root, generating a growing set of daughter species as time goes 
by. Species here is a general denomination for biological species, spoken 
languages or any other entity evolving through heredity. From the N cur- 
rently alive species within a clade, distances are measured through pair- 
wise comparisons made by geneticists, linguists, etc. The larger is such a 
distance for a pair of species, the older is their last common ancestor. The 
aim is to reconstruct the past unknown bifurcations, i.e. the whole clade, 
from the knowledge of the N(N — 1)/2 quoted distances taken for granted. 
A mechanical method is presented, and its applicability discussed. 

PACS: 02.10.Ud; 89.75.Hc; 87.23.Kg. 



Natural experiments pQ are those where the experimenter cannot manipulate 
the object of study. Only comparisons can be made. It is a recent field of re- 
search allowing quantitative studies of historical evolutions. Figure 1 exemplifies 
a clade. At left, the traditional cladogram showing the successive speciations. 
This kind of draw is familiar to geneticists, linguists, etc. A good historical de- 
scription entitled Trees before and after Darwin was recently published [2] ■ At 
right, on Figure 1, we add the not so familiar concept of ultrametric distances 
on which our analysis is based. A direct measure of such a distance demands 
scarce fossil data. However, researchers perform indirect measurements of such 
pairwise distances by comparing features of currently alive species. How to per- 
form these measurements is a vast field of research, out of the current scope. 
A good review can be found in [3] and references therein. Human evolution 
can be traced back with genetic or linguistic measurements 0]. A further, still 
incomplete list of works is shown in [5] for genetics, and [6] for linguistics. Here, 
we simply suppose the pairwise distances of a given clade were measured among 
its N current alive species, i.e. a set of N(N — l)/2 positive numbers taken for 
granted. The purpose is to reconstruct the whole tree from these data. 
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Figure 1: A clade: Closed circles represent the 6 known currently alive species, 
open circles their past ancestors. The ultrametric distance between two alive 
species is the time counted from today back to their last common ancestor: 15 
distances Dy are shown in the spectrum at right. The uppermost level is 8- fold 
degenerate, i.e. the same distance appears 8 times. It corresponds to the root, 
the single original species. The second level is also 4-fold degenerate. The whole 
clade (family) is divided in two sub-clades (genera), the 4 rightmost currently 
alive species belong to one genus, the 2 leftmost to a second genus. 

In short, by knowing the distances exemplified at right in Figure 1, the 
problem is to draw the corresponding tree at left. A mechanical solution follows, 
each currently alive species is associated to a unitary mass particle moving along 
a X axis. Particles are transparent, they can pass through each other. All N 
particles are initially released at the origin x — with random velocities (zero 
sum, keeping the center of mass at rest). Particles interact through the energy 

E=l^Jij(xi-x j ) 3 (1) 
summed over all pairs (i,j = 1,2,... N), with coupling constants given by 

Jii = D — Dij (2) 

where D is an adjustable parameter (we will get rid of it soon). are the 
quoted distances. The movement follows Newton's law, the accelerations 

X% = ^ ^ Jij{xi Xj) (3) 
3 

form a set of N linear, second order differential equations which can be solved 
by diagonalizing its corresponding N x N secular matrix. Before that, let's 
foresee the movement. 
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Take D in between the two uppermost levels, Figure 1. If alive species i and 
j belong to the same genus, the coupling constant Jij is positive (attraction). 
Otherwise, Jy is negative (repulsion). The two genera repel each other, while 
attraction holds inside each genus. Therefore, n (N — n) particles belonging to 
one (the other) genus remain clustered running away towards one (the other) 
sense along the X axis. The eventual partition defines two genera. The same 
process is repeated within each just discovered genus, and so on, reconstructing 
the whole clade. 

The secular matrix of Equation (3) can be divided as 



S - ND I + D G 

where I is the identity and G is a N x N matrix with all entries Gij 
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1. Matrix 
(5) 
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does not depend on (hereafter discarded) D, only on the measured distances 
Dij. Their N eigenvectors completely define the movement. Among them, two 
deserve particular comments, two next paragraphs. 

The Goldstone eigenvector has N unitary entries (1,1,1. ..1,1), whose eigen- 
value is always null. Interaction energy (1) presents only internal forces between 
the N particles themselves, thus the global center of mass remains at rest at 
the origin, i.e. x\ + x-i + x 3 . . . x^-i + xn = 0. Furthermore, any other eigen- 
vector (ai, 02, 03... ajv-i 5 o-n), is orthogonal to this always-present Goldstone, 
i.e. a\ + C12 + 03 . . . (Xjv-i + a N = 0. In other words, any eigenvector besides the 
Goldstone is a series of positive and negative entries with zero sum. Matrix G 
nullifies all these further eigenvectors. 




Figure 2: Some analytically solved trees. 
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Among them, the eigenvector with highest eigenvalue presents n positive 
entries for one genus, N — n negative entries for the other genus, thus solving 
our problem. Let's call it the partition eigenvector. For the simple clade shown 
at left in Figure 2, for instance, the partition eigenvector is (2, — 1, —1). For the 
largest clade at right, it is (N — n, N — n, N — n . . . — n, —n). 

The other N — 2 eigenvectors are unimportant, governing only the inter- 
nal movement within each already-separated genus. In short, given some 
dataset, the only task is to compose matrix (5), finding its eigenvector with 
highest eigenvalue. The signs of its entries describe the correct partition. 

However, measured distances suffer from drift fluctuations. Therefore, they 
do not reflect the exact degeneracies. Each degenerate level becomes a band of 
neighboring levels, no longer degenerate. In Figure 1, the highest and second 
highest levels would be represented by two bands with 8 and 4 separated levels, 
respectively. 

Once a given bifurcation was successfully reconstructed, the corresponding 
broken degeneracy can be restored as follows. Let's ni and rii be the number of 
species in each branch of the quoted bifurcation. Then, the n\rii corresponding 
distances displayed in the measured dataset represent indeed different measure- 
ments of the same single value, namely the real ultrametric distance. It is, 
then, better represented by the average over measurements: one replaces all 
n\ni distances in the original dataset by their average, restoring the cor- 
responding degeneracy. (Besides the average, dispersion serves to estimate the 
age uncertainty.) 

The band widths increase with the total evolutionary time, due to accumu- 
lated random drift. If the clade under study is too old, these bands tend to 
overlap over each other, and the model may fail beyond some degree of random- 
ness, as any other method. How is it robust against these fluctuations? 

Hereafter, we analyze the method performance in these real situations. Our 
strategy is simple: to test the method with clades for which one knows the 
entire past history since the first bifurcation. We construct these clades in a 
computer, following two ingredients [7J. First, one starts from a single species. 
With a small fixed probability b, at each new time step the species can bifurcate. 
After that, each of the two emerging species evolves independent of the other. 
New branches may also bifurcate. One can book the exact times when each 
bifurcation occurs. An example of such a tree is shown in Figure 3. 

Second ingredient, each species' internal characteristics are represented by a 
sequence of L bits I or 0. At each time step, this bitstring is mutated, i.e. an 
average number m of its bits are randomly chosen and inverted from to I or vice 
versa. When some species bifurcates, its current bitstring is copied to each new 
branch, both suffering independent random mutations thereafter. The scaling 
between m and L sets the maximum evolutionary time (time-back horizon) one 
can hope to reconstruct with the available accuracy. We use different seeds for 
random number generators governing the bifurcations (i?l) or mutations (i?2). 
Keeping the same Rl for different computer runs with different values of to, one 
can test the same tree topology under different mutation rates. 
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Figure 3: Computer generated evolutionary tree. Starting from a single species, 
a first bifurcation occurs at t — 0. A discrete clock t = 1,2,3... runs downwards. 
Each branch (species) can bifurcate with probability b = 0.0005 at any time. 
When a bifurcation occurs, the number inside the corresponding square shows 
the quantity of alive species below it at t = 8, 000 (today, not shown) when one 
counts a total of N = 109 alive species in this particular realization. Bifurcations 
occurred beyond t — 5, 000 are not shown for clarity. 

The distance between two bitstrings is the number of unmatched bits di- 
vided by L/2 (the random expected value) for normalization. Some similar 
normalization procedures are followed by linguists in order to cancel out pho- 
netic accidental coincidences [8]. The maximum distance should be 1. However, 
due to fluctuations, old clades may present some distances slightly larger than 
1, say a percentage P%. They are of course statistically meaningless data. A 
nearly equal quantity below 1 is also supposed to be meaningless. Thus, 2P% 
is a first, crude estimate for the dataset degree of distrust. Another approach 
would be the analysis of never-touched-bits, how this set still holding the original 
information shrinks as evolutionary time goes by [9]. 

Many clades like that exemplified in Figure 3 were tested, the result is indeed 
the expected one: perfect reconstruction up to a certain degree of randomness. 
The general behavior can be appreciated in Figure 4, showing (the top part of) 
the eigenvalue spectrum of matrix (5) — not to be confounded with the spectrum 
of distances. At left, the reconstruction is perfect, while the highest eigenvalue 
is still separated from the band below it, thus corresponding to the correct 
partition eigenvector. Note also the saturation of the whole spectrum near 
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the largest conceivable eigenvalue N, for increasing randomness. The correct 
partition is obtained up to m = 4.1, where approximately 2P — 68% of the whole 
dataset is statistically meaningless. For m = 5.1 with 2P = 77%, the partition 
eigenvalue is surpassed by some other competitor, and the reconstruction fails. 
As a rule-of-thumb for real D^- datasets, they can be safely considered trustable 
if: I) The highest eigenvalue of matrix (5) is separated from the others; and II) 
At least the second largest eigenvalue is smaller than the upper limit N. 
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Figure 4: Six topmost secular eigenvalues with increasing fluctuation (to =1.1, 
2.1, 3.1, 4.1 and 5.1 from left to right, L = 16,384), for the same tree in 
Figure 3. Other 103 smaller eigenvalues are not shown (indicated by dots). 
For increasing fluctuation, the whole spectrum shrinks and saturates near the 
maximum conceivable value N = 109 (except the always-present null Goldstonc 
eigenvalue). Thus, in the limit of large fluctuations, the topmost eigenvalue 
becomes no longer isolated from the band below it, the genera partition may 
become wrong. In the current case this occurs at right, for m = 5.1, when the 
2, 464 distances corresponding to the highest level in Figure 3 form a wide band 
with only 16% of the remainder 3, 422 data below it, as they all should be. 

In hard cases like the rightmost in Figure 4, the remote past of the first 
bifurcations is inaccessible with the accuracy at disposal. The measured dataset 
presents too much statistically meaningless entries (distances near 1). They can 
be gradually expunged as follows. All N(N — l)/2 links between species are 
initially present, forming a completely connected network with N vertices. One 
cuts the link corresponding to the largest distance Dij, then the second largest, 
and so on. At some point along this sequence, the network becomes disconnected 
in two pieces. The very old history about this separation is inaccessible. But 
the recent history is not lost. The secular matrix method is then applied to each 
piece separately (expunged internal links re-included). If the above criteria I 
and II are not fulfilled for one piece, the cut-link procedure continues within 
it. Following this strategy for the tree in Figure 3 with m — 5.1, the isolated 
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species T is first disconnected from the other 108. Then, also isolated species 
S disconnects, followed by A and by C. The set of 105 remainder species still 
does not fulfill the correctness criteria I and II. Next, the block BH with 10 
species disconnects, and can be successfully reconstructed by the method. The 
remainder 95 species do not. Continuing the cuttings, block UK disconnects, 
then P, then FQR, then G and L, all successfully reconstructed, and so on. 

After each successful step where a correct bifurcation partition is found (un- 
der criteria I and II), the corresponding degeneracy is restored, changing the 
original dataset. Thus, one can re-start the whole process using the new dataset. 
Degeneracies are, thus, hierarchically and gradually restored. They can also be 
partially restored, by combining already defined blocks (groups among A, B, C 
. . . R, S, T). After many runs, in case the complete reconstruction still fails, one 
can observe which blocks are most responsible for failures (normally isolated 
species or small sub-clades), and remove them from the dataset. Within the 
hard case m = 5.1 in Figure 3, by removing isolated species A, C, S and T, 
and also blocks H, L and PQR, the remainder main tree with N = 78 species 
could be correctly reconstructed. Isolated, each removed block is also correctly 
reconstructed, but one cannot know where or when it should be branched from 
the main tree, because this occurred before the time-back horizon at disposal. 

There are a lot of alternative methods to reconstruct evolutionary trees 
[31 [TU] ■ None of them considers the degeneracies appearing in the spectrum of 
real, ultramctric distances (past time since the last common ancestor), neither 
the breaking of these same degeneracies due to evolutionary drift. This feature 
distinguishes the present method, the quoted degeneracies are gradually and 
hierarchically restored. 

Also, the model dynamics considers all N(N — l)/2 distances at each step, 
mitigating the effects of the statistically meaningless part of the data. This 
feature is absent from traditional methods as the pioneering UPGMA, where 
the pair of species corresponding to the smallest distance is joined into a single 
species, whose distance to each remainder species is the average between both 
former distances. Instead of simply choosing the smallest distance, minimum- 
evolution approaches minimize quantities involving all distances, like our method, 
so improving the performance. Nevertheless, being yet neighbor-joining recipes, 
the number of distances is always reduced by N — 1 at each step. Reference 
[TO] presents such a method and comparisons with others. The best perfor- 
mances are equivalent to ours. Indeed, in many tests, whenever the correct 
partition is obtained by the method in [10j . it is also obtained by the current 
one. The tree in Figure 3, for instance, is correctly reconstructed up to m — 4.1 
by both methods, but both fail under a little bit larger degree of randomness, 
m = 4.2. This coincidence indicates that reconstruction correctness is limited 
only by fluctuations in the measured dataset, not by drawbacks of the methods 
themselves. Moreover, when both fail, two different (wrong) partitions were 
observed. A posteriori, their comparison serves as a further, final criterion for 
reconstruction correctness. 

The author is indebted to S0ren Wichmann and Jose Soares de Andrade Jr. 
for critical readings of the manuscript and helpful suggestions. 
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